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METHOD AND SYSTEM FOR CONVERTING FILES 
TO A SPECIFIED MARKUP LANGUAGE 

Background Of The Invention 

Field of the Invention 

This invention generally relates to information processing in a computer system, and 
more specifically, to converting data files to a specified format. Even more specifically, 
the invention relates to methods and systems particularly well suited for converting data 
to an XML format. 

Prior Art 



Extensible Markup Language (XML) is a format for storing computer data and is 
becoming increasingly popular, particularly for data that are accessed or transmitted over 
20 the Internet. With XML as an emerging technology, there is a need to convert legacy 
data to an XML format. While data can be converted by hand, on a case-by-case basis, 
there is currently no widely applicable, generalized and automated technique to do this 
conversion. 

25 Summary Of The Invention 

An object of this invention is to provide a simple, generalized procedure for converting 
data to a specified markup language. 

30 Another object of the present invention is to provide a parser that can be used on its own 
or as part of a larger system to convert large amounts of data quickly to an XML format. 
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A further object of the invention is to provide a parser that can be used, with little or no 
modification, to convert any delimited file input into any XML formatted output. 

These and other objects are attained with a method and system for converting a 
5 delimited flat file to a markup language specified by a document type defmition file. 
The method comprises the steps of providing a delimited flat file having columns with 
headings, and providing a map file conforming to said document type definition file and 
having tags and attributes including references matching said headmgs. A tree structure 
is formed from the map file, with each tag representing one or more nodes of the free. 
1 0 The tree structure is traversed, node-by-node, and for each node, the attributes are 
entered into said markup language file. When the attributes include one of said 
references, text is retrieved from one of the columns with one of the matching headings 
of the flat file, and that text is entered into the markup language file. 

1 5 The preferred embodiment of the invention is very well suited to convert legacy data to 
an XML format. If this legacy data is in the form of a delimited flat-file, this parser will 
automatically perform the conversion. Similarly, if data comes from an EDI transaction, 
an existmg converter could be used to transform the EDI document to a flat file, and this 
new parser could then transform the data to an XML document. XML is quickly 

20 becoming a common format for storing and using data which comes in through EDI. 
The user needs only to create a map file (the map file is an XML file conforming to a 
specific DTD), which tells the parser which pieces of mformation should be included ui 
which elements/attributes of the resulting XML document. This parser could be 
incorporated into a larger system which uses the XML document. Since the map file is 

25 itself an XML document conforming to a specific DTD, a user interface could easily be 
created for writing and updating map files. It should be noted that the parser output is 
not limited to a file. It could also be a string (instead of a file), which could be 
transferred over the network, transformed to HTML and displayed on a browser, stored 
in a database, added to a message queue, etc. 

30 
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Further benefits and advantages of the invention will become apparent from a 
consideration of the following detailed description, given with reference to the 
accompanying drawings, which specify and show preferred embodiments of the 
invention. 

Brief Description Of The Drawings 

Figure 1 is a block diagram of a computer workstation in which the present invention 
may be practiced. 

Figure 2 illustrates a networked computing environment in which the present invention 
may be practiced. 

Figure 3 outlines a method embodying the invention. 

Figure 4 shows the contents of a map file that may be used in this invention. 

Detailed Description Of The Preferred Embodiments 



20 Figure 1 illustrates a representative workstation hardware environment in which the 

present invention may be practiced. This environment comprises a representative single 
user computer workstation 10, such as a personal computer, including related peripheral 
devices. The workstation 10 includes a microprocessor 12 and a bus 14 employed to 
connect and enable communication between the microprocessor 12 and the components 

25 of the workstation 10 in accordance with known techniques. The workstation 10 

typically includes a user interface adapter 16, which connects the microprocessor 12 via 
the bus 14 to one or more interface devices, such as a keyboard 18, mouse 20, and/or 
other interface devices 22, which can be any user interface device, such as a touch 
sensitive screen, digitized entry pad, etc. The bus 14 also connects a display device 24, 

30 such as an LCD screen or monitor, to the microprocessor 12 via a display adapter 26. 



END920010032US1 



3 



The bus 14 also connects the microprocessor 12 to memory 28 and long-term storage 30 
which can include a hard drive, diskette drive, tape drive, etc. 

The workstation 10 communicates via a communications channel 32 with other 
computers or networks of computers. The workstation 10 may be associated with such 
other computers in a local area network (LAN) or a wide area network, or the 
workstation 10 can be a client in a client/server arrangement with another computer, etc. 
All of these configurations, as well as the appropriate communications hardware and 
software, are known in the art. 

Figure 2 illustrates a data processing network 40 in which the present invention may be 
practiced. The data processing network 40 includes a plurality of individual networks, 
including LANs 42 and 44, each of which includes a plurality of individual workstations 
10. Alternatively, as those skilled in the art will appreciate, a LAN may comprise a 
plurality of intelligent workstations coupled to a host processor. 

Still referring to Figure 2, the data processing network 40 may also include multiple 
mainframe computers, such as a mainframe computer 46, which may be coupled to the 
LAN 44 by means of a communications link 48. The mainframe computer 46 may be 
implemented utilizing an Enterprise Systems Architecture/370, or an Enterprise Systems 
Architecture/390 computer available from the International Business Machines 
Corporation (IBM). Depending on the application, a midrange computer, such as an 
Application System/400 (also known as an AS/400) may be employed. "Enterprise 
Systems Architecture/370", "Enterprise Systems Architecture/390", "Enterprise Systems 
Architecture/400", and "AS/400" are registered trademarks of IBM. 

The mainframe computer 46 may also be coupled to a storage device 50, which may 
serve as remote storage for the LAN 44. Similarly, the LAN 44 may be coupled to a 
communications link 52 through a subsystem control unit/communication controller 54 
and a communications link 56 to a gateway server 58. The gateway server 58 is 
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preferably an individual computer or intelligent workstation which serves to link the 
LAN 42 to the LAN 44. 



Those skilled in the art will appreciate that the mainframe computer 46 may be located a 
5 great geographic distance from the LAN 44, and similarly, the LAN 44 may be located a 
substantial distance from the LAN 42. For example, the LAN 42 may be located in 
California, while the LAN 44 may be located in Texas, and the mainframe computer 46 
may be located in New York. 

10 Software programming code which embodies the present invention is typically accessed 
by the microprocessor 12 of the workstation 10 from long-term storage media 30 of 
some type, such as a CD-ROM drive or hard drive. In a client-server environment, such 
software programming code may be stored with storage associated with a server. The 
software programming code may be embodied on any of a variety of known media for 
15 use with a data processing system, such as a diskette, hard drive, or CD-ROM, The code 
may be distributed on such media, or may be distributed to users from the memory or 
,H storage of one computer system over a network of some type to other computer systems 

J for use by users of such other systems. Alternatively, the programming code may be 

embodied in the memory 28, and accessed by the microprocessor 12 using the bus 14. 
20 The techniques and methods for embodying software programming code in memory, on 
physical media, and/or distributing software code via networks are well known and will 
not be fiirther discussed herein. 

The data stream resulting from use of the present invention may be stored on any of the 
25 various media types used by the long-term storage 30, or may be sent from the 

workstation 10 to another computer or workstation of the network illustrated in Figure 2 
over the communications channel 32, for storage by that other computer or workstation. 

As mentioned above, XML is becoming an increasingly popular format for data that is 
30 transmitted between and accessed from computer networks, and Figure 3 shows a 
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conversion method, embodying this invention, tiiat may be used to convert data into an 
XML format. This method is for converting a delhnited flat file to a markup language 
specified by a document type definition file; and the method comprises the steps of 
providing a delimited flat file having columns with headings, and providing a map file 

5 conforming to the document type definition file and having tags and attributes including 
references matching the headings. A tree structure is formed from the map file, with 
each tag representing one or more nodes of the tree; and the tree structure is traversed, 
node-by-node, and for each node, the attributes are entered into said markup language 
file. When the attributes include one of said references, text is retrieved from one of the 

1 0 columns with one of the matching headings of the flat file, and that text is entered into 
the markup language file. 

Also as indicated previously, the preferred embodiment of the invention is well suited to 
convert legacy data to an XML format. If this legacy data are in the form of a delimited 

1 5 flat-file, Ms parser will automatically perform the conversion. Similarly, if data comes 
from an EDI transaction, an existing converter could be used to transform the EDI 
document to a flat file, and this new parser could then transform the data to an XML 
document. XML is qmckly becoming a common format for storing and using data 
which comes in through EDI. The user needs only to create a map file (the map file is 

20 an XML file conforming to a specific DTD), which tells the parser which pieces of 
information should be mcluded in which elements/attributes of the resulting XML 
document. This parser could be incorporated into a larger system which uses the XML 
document. Since the map file is itself an XML document conforming to a specific DTD, 
a user interface could easily be created for writmg and updatmg map files. 

25 

The parser starts with the user-created, user-specified map file. This is an XML 
document conforming to a specific Document Type Definition (DTD). The parser reads 
this file along with a user-specified tab-delimited file. (The code could be easily 
modified to handle other delimiters.) The delimited file is parsed, and a new XML 
30 document is created (assigned a user-specified name). The new XML document 
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contains the information from the delimited file in an XML format specified by the map 
file. This parser could be used on its own, to convert large amounts of data quickly to an 
XML format. Altematively, it could be used as part of a larger system. For instance, if 
the user needs to convert EDI data input into an XML document, this parser could be 
5 used in conjunction with an existing EDI converter (which converts the EDI data to a 
delimited file). 

The parser includes two major pieces, the parser code itself and map.dtd.. map.dtd is the 
specification for the map file, and this map file tells the parser how to format the data in 
1 0 the resulting XML document. The contents of a map.dtd file are shown, for an example, 
in Figure 4. 

The user must create an XML document which conforms to this DTD. The map file 
gives the following information: 



(1) the element and attribute structure for the resultmg XML document. 



(2) the name of the column in the flat file from which to get each piece of PCDATA and 
CDATA (or the exact text to be printed if this is to be default text - i.e., the same in each 
20 element) in the resulting XML document. 

The file specified by the user must have headings on the columns which match the 
references in the map file. 

25 The parser reads the map file into a DOM tree, for example, using IBM XML-4J version 
3.0. A DOM tree, it may be noted, is an established reconomendation by the w3c. From 
the information included in the map, the parser will know where to print which element 
and where to get the information to put into the elements. Therefore no matter what 
DTD and flat file the user specified, the parser will be able to create an XML document. 

30 A recursive method is used to move through the DOM tree of the map file, printing 
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element_nodes and attribute_nodes whenever they exist, ptilling the text from the 
specified columns in the delimited file. The parser knows how to deal with these by way 
of the user's map file - the user will specify in the map which elements are to be 
optional, which are to have multiple occurrence, which are nested, etc. map.dtd 
specifies how to write a map file that describes any of these scenarios. 

The parser can deal with optional elements, multiple occurrence elements, nested 
elements, PCDATA anywhere withm an element (including after a child element), 
attributes, doctypes, processing instructions. 

The resulting document will be either printed to a file or stored as a string, depending on 
the user specification. It should be noted that the parser output is not limited to a file. It 
could also be a string (instead of a file), which could be transferred over the network, 
transformed to HTML and displayed on a browser, stored in a database, added to a 
mess^e queue, etc. 

The parser of this invention is very general. It can be used (wilii little or no 
modification) for any delimited file input and any XML formatted output. The formats 
of the input and output files need only to be specified in a relatively simple map file 
conforming to the map DTD included with the parser. 

While it is apparent that the mvention herein disclosed is well calculated to fulfill the 
objects stated above, it will be appreciated that numerous modifications and 
embodiments may be devised by those skilled in the art, and it is intended that the 
appended claims cover all such modifications and embodiments as fall within the true 
spirit and scope of the present invention. 
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